-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(low-code cdk): add lazy read to simple retriver #418
Conversation
/autofix
|
📝 WalkthroughWalkthroughThis pull request introduces functionality for lazy loading and incremental synchronization in the declarative sources. It updates the Changes
Possibly related PRs
Suggested reviewers
Would you like to proceed with these reviewers, or do you have others in mind? Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/retrievers/__init__.py (1)
8-13
: Consider running code formatting tools.It appears ruff formatting is requested. Perhaps running
ruff --fix
or a similar command could automatically fix style nits. wdyt?<no explicit code snippet here, just a recommendation to run format tools>airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (1)
9-19
: Use typed mappings for clearer definitions.The
from typing import Mapping
usage is missing type parameters (e.g.Mapping[str, Any]
). Would you consider adding them for stronger type checks? wdyt?- from typing import ( ... Mapping, ... ) + from typing import ( ... Mapping, Any, ... ) def process_parent_record( self, - parent_record: Union[AirbyteMessage, Record, Mapping], + parent_record: Union[AirbyteMessage, Record, Mapping[str, Any]], parent_stream_name: str ) -> Tuple[Optional[Mapping[str, Any]], Optional[Mapping[str, Any]]]: ...airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (3)
627-636
: Add type annotations for SafeResponse.The pipeline reports missing annotations on
__getattr__
,content
, andcontent.setter
. Could you add explicit types for clarity? wdyt?class SafeResponse(requests.Response): - def __getattr__(self, name): + def __getattr__(self, name: str) -> Any: return getattr(requests.Response, name, None) @property - def content(self): + def content(self) -> bytes: return super().content @content.setter - def content(self, value): + def content(self, value: Union[str, bytes]) -> None: self._content = value.encode() if isinstance(value, str) else value🧰 Tools
🪛 GitHub Actions: Linters
[error] 628-628: Function is missing a type annotation [no-untyped-def]
[error] 632-632: Function is missing a return type annotation [no-untyped-def]
[error] 636-636: Function is missing a type annotation [no-untyped-def]
679-691
: Use typedMapping[str, Any]
consistently.For
_extract_child_records
and similar methods, adding[str, Any]
clarifies usage. Consider ensuring dpath usage references correct mutable or immutable types. wdyt?-def _extract_child_records(self, parent_record: Mapping) -> Mapping: +def _extract_child_records(self, parent_record: Mapping[str, Any]) -> Mapping[str, Any]: ...🧰 Tools
🪛 GitHub Actions: Linters
[error] 679-679: Missing type parameters for generic type 'Mapping' [type-arg]
[error] 686-686: Incompatible return value type (got 'Any | object', expected 'Mapping[Any, Any]') [return-value]
[error] 686-686: Argument 1 to 'values' has incompatible type 'Mapping[Any, Any]'; expected 'MutableMapping[Any, Any]' [arg-type]
[error] 688-688: Argument 1 to 'get' has incompatible type 'Mapping[Any, Any]'; expected 'MutableMapping[Any, Any]' [arg-type]
[error] 691-691: Missing type parameters for generic type 'Mapping' [type-arg]
738-738
: Avoid passing Record directly into dpath.get.
dpath.get
expects a (mutable) mapping. For aRecord
, perhaps fetch.data
first to preserve consistent usage. wdyt?- partition_value = dpath.get(parent_record, ...) + partition_value = dpath.get(parent_record.data, ...)🧰 Tools
🪛 GitHub Actions: Linters
[error] 738-738: Argument 1 to 'get' has incompatible type 'Record'; expected 'MutableMapping[Any, Any]' [arg-type]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(3 hunks)airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
(3 hunks)airbyte_cdk/sources/declarative/retrievers/__init__.py
(1 hunks)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
(3 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/retrievers/__init__.py
[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Please run 'ruff format' to fix code style issues in this file.
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
[error] 147-148: Missing type parameters for generic type 'Mapping' [type-arg]
[error] 164-164: Item 'None' of 'AirbyteFileTransferRecordMessage | Any | None' has no attribute 'data' [union-attr]
[error] 215-215: Incompatible types in assignment (expression has type 'Mapping[Any, Any] | None', variable has type 'Mapping[str, Any] | AirbyteMessage') [assignment]
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
[error] 2714-2714: 'SimpleRetriever' has no attribute 'lazy_read_pointer' [attr-defined]
[error] 2719-2719: 'SimpleRetriever' has no attribute 'lazy_read_pointer' [attr-defined]
[error] 2722-2722: Argument 'model' to '_create_component_from_model' of 'ModelToComponentFactory' has incompatible type 'CustomPartitionRouter | ListPartitionRouter | SubstreamPartitionRouter | list[CustomPartitionRouter | ListPartitionRouter | SubstreamPartitionRouter] | None'; expected 'BaseModel' [arg-type]
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
[error] 628-628: Function is missing a type annotation [no-untyped-def]
[error] 632-632: Function is missing a return type annotation [no-untyped-def]
[error] 636-636: Function is missing a type annotation [no-untyped-def]
[error] 646-646: Incompatible types in assignment (expression has type 'None', variable has type 'SubstreamPartitionRouter') [assignment]
[error] 659-659: Incompatible types in assignment (expression has type 'Mapping[Any, Any] | None', variable has type 'Mapping[str, Any] | AirbyteMessage') [assignment]
[error] 665-665: Argument 1 to '_extract_child_records' of 'LazySimpleRetriever' has incompatible type 'Mapping[str, Any] | AirbyteMessage'; expected 'Mapping[Any, Any]' [arg-type]
[error] 673-673: Argument 5 to '_yield_records_with_pagination' of 'LazySimpleRetriever' has incompatible type 'Mapping[str, Any] | AirbyteMessage'; expected 'Record' [arg-type]
[error] 679-679: Missing type parameters for generic type 'Mapping' [type-arg]
[error] 686-686: Incompatible return value type (got 'Any | object', expected 'Mapping[Any, Any]') [return-value]
[error] 686-686: Argument 1 to 'values' has incompatible type 'Mapping[Any, Any]'; expected 'MutableMapping[Any, Any]' [arg-type]
[error] 688-688: Argument 1 to 'get' has incompatible type 'Mapping[Any, Any]'; expected 'MutableMapping[Any, Any]' [arg-type]
[error] 691-691: Missing type parameters for generic type 'Mapping' [type-arg]
[error] 738-738: Argument 1 to 'get' has incompatible type 'Record'; expected 'MutableMapping[Any, Any]' [arg-type]
[error] 758-758: Argument 1 to '_next_page_token' of 'SimpleRetriever' has incompatible type 'Response | None'; expected 'Response' [arg-type]
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2722-2722
:❓ Verification inconclusive
Ensure
model.partition_router
is a valid BaseModel.When calling
_create_component_from_model(model=model.partition_router, ...)
, you may be passing a non-BaseModel object (e.g.,SubstreamPartitionRouter
). Could you wrap or convert it into a Pydantic model before this call? wdyt?
🏁 Script executed:
#!/bin/bash # Searching references to partition_router usage for create_simple_retriever rg -A 3 'partition_router.*create_simple_retriever'Length of output: 53
Verify Pydantic Conversion for
partition_router
At line 2722 in
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
, we're passingmodel.partition_router
to_create_component_from_model
. The concern is thatmodel.partition_router
might not be a valid Pydantic BaseModel (it could be a non-BaseModel likeSubstreamPartitionRouter
). Our initial repository search didn’t reveal references linkingpartition_router
with methods likecreate_simple_retriever
, so it’s a bit unclear whether this conversion is already handled elsewhere.Could you please manually verify if:
model.partition_router
is indeed an instance of a Pydantic BaseModel, or- It requires wrapping/conversion (e.g., via a proper Pydantic model) before being passed to
_create_component_from_model
?Wdyt?
🧰 Tools
🪛 GitHub Actions: Linters
[error] 2722-2722: Argument 'model' to '_create_component_from_model' of 'ModelToComponentFactory' has incompatible type 'CustomPartitionRouter | ListPartitionRouter | SubstreamPartitionRouter | list[CustomPartitionRouter | ListPartitionRouter | SubstreamPartitionRouter] | None'; expected 'BaseModel' [arg-type]
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)
686-689
: Handle dpath return more robustly.
dpath.values(...)
might return a list, not a dict. That conflicts with the method signature returningMapping[str, Any]
. Would you unify the return type or adjust usage? wdyt?🧰 Tools
🪛 GitHub Actions: Linters
[error] 686-686: Incompatible return value type (got 'Any | object', expected 'Mapping[Any, Any]') [return-value]
[error] 686-686: Argument 1 to 'values' has incompatible type 'Mapping[Any, Any]'; expected 'MutableMapping[Any, Any]' [arg-type]
[error] 688-688: Argument 1 to 'get' has incompatible type 'Mapping[Any, Any]'; expected 'MutableMapping[Any, Any]' [arg-type]
673-673
:❓ Verification inconclusive
Consolidate parent record type to 'Record'.
This call passes a union type to
_yield_records_with_pagination()
where a pureRecord
is expected. Would you cast the value or unify the type to always be aRecord
? wdyt?
🏁 Script executed:
#!/bin/bash # Searching call sites for `_yield_records_with_pagination` rg '_yield_records_with_pagination\('Length of output: 265
Consolidate parent record type to a pure Record
It looks like theparent_record
passed on line 673 ofairbyte_cdk/sources/declarative/retrievers/simple_retriever.py
carries a union type, but_yield_records_with_pagination()
expects aRecord
. Would you consider either explicitly casting the value or unifying the type so that it is always aRecord
? wdyt?🧰 Tools
🪛 GitHub Actions: Linters
[error] 673-673: Argument 5 to '_yield_records_with_pagination' of 'LazySimpleRetriever' has incompatible type 'Mapping[str, Any] | AirbyteMessage'; expected 'Record' [arg-type]
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
3212-3219
: Suggestion: Clarify the "items" schema definition for lazy_read_pointerThe new lazy_read_pointer property looks great for enabling lazy reading! One thought, though – since its type is “array” and you expect each element to be a string, would you consider defining the items schema as an inline object instead of a list with a single element? For example, using:
items: type: stringinstead of:
items: - type: stringThis might improve clarity and ensure consistency with how other array items are defined in the schema. wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(1 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(10 hunks)
✅ Files skipped from review due to trivial changes (1)
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (1)
146-180
: Consider a debug log when skipping invalid or non-record data.
Currently, the code returnsNone, None
for invalid data. Would you find it helpful to log the reason for skipping as well, to aid debugging? wdyt?airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)
640-651
: Double-check SafeResponse content encoding logic.
Could there be edge cases where the encoding differs from UTF-8, and do we want to handle them here? wdyt?
657-787
: Consider validating the parent record structure before lazy extraction.
Would an additional check for data shape or type help catch potential mismatches before calling_extract_child_records
? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(3 hunks)airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
(3 hunks)airbyte_cdk/sources/declarative/retrievers/__init__.py
(1 hunks)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py
- airbyte_cdk/sources/declarative/retrievers/init.py
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
[error] 2734-2734: Argument 'model' to '_create_component_from_model' of 'ModelToComponentFactory' has incompatible type 'CustomPartitionRouter | ListPartitionRouter | SubstreamPartitionRouter | list[CustomPartitionRouter | ListPartitionRouter | SubstreamPartitionRouter] | None'; expected 'BaseModel'.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (1)
213-221
: Verify if large parent partitions might degrade performance.
When iterating over many parent records, do you think we need a checkpoint or limit to avoid potential memory issues? wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
438-438
: No concerns with referencing LazySimpleRetriever.
Implementation looks clean. wdyt?
2651-2657
: Check type safety for 'incremental_sync' parameter.
The code uses a union of advanced cursor classes, and the pipeline logs mention a type mismatch if passed to_create_component_from_model()
. Would it make sense to handle them separately or ensure they implementBaseModel
? wdyt?airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)
9-24
: No issues with the new import statements.
Everything appears consistent. wdyt?
653-656
: Deprecation annotation looks good.
This approach cleanly warns users of experimental status. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2720-2727
:⚠️ Potential issueAdditional type checking needed for lazy_read_pointer
Similar to the previous issue, we should check if
lazy_read_pointer
exists as an attribute on the model before trying to access it. The GitHub Actions linter flag shows the same issue.Consider updating the condition to:
- if model.lazy_read_pointer and not bool( + if hasattr(model, "lazy_read_pointer") and model.lazy_read_pointer and not bool( self._connector_state_manager.get_stream_state(name, None) ):
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2722-2722
: Type annotation issue when passing partition_router modelThe comment
# type: ignore[arg-type] # model.partition_router has BaseModel type
indicates awareness of the type issue, but could be addressed more cleanly.Perhaps consider creating a specific model type for lazy retriever that properly inherits or composes with
SubstreamPartitionRouterModel
to make type checking work better? This would also address previous review comments about attribute errors.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
438-438
: Import of LazySimpleRetriever addedAdding this import to support the new lazy read functionality. Looks good!
2651-2655
: Method signature extension looks good!The method now supports optional incremental sync parameters for the new lazy reading functionality.
2728-2758
: Implementation of LazySimpleRetriever instantiation looks well structuredThe code correctly:
- Creates lazy_read_pointer from interpolated strings
- Creates the partition_router from the model
- Sets up the stream_slicer based on incremental_sync
- Returns properly configured LazySimpleRetriever with all required parameters
However, be aware that type annotations in lines 2734-2735 show type issues with
model.partition_router
. It might work at runtime but could cause type checking problems.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
2747-2754
: Should we clarify the lazy loading behavior in a comment?The condition checks both for
lazy_read_pointer
and the absence of stream state. This suggests lazy loading is only used when starting from scratch - might be worth adding a clarifying comment explaining this intentional behavior, wdyt?- if model.lazy_read_pointer and not bool( - self._connector_state_manager.get_stream_state(name, None) - ): + # Only use lazy loading when starting from scratch (no existing state) + # This ensures we don't skip already processed records when resuming + if model.lazy_read_pointer and not bool( + self._connector_state_manager.get_stream_state(name, None) + ):
2756-2760
: Consider adding a type annotation for lazy_read_pointer.For improved code clarity, you might want to add a type annotation to make it clear that
lazy_read_pointer
is expected to be a list of strings that get converted toInterpolatedString
objects, wdyt?+ # Convert each path in lazy_read_pointer to an InterpolatedString lazy_read_pointer = [ InterpolatedString.create(path, parameters=model.parameters or {}) for path in model.lazy_read_pointer ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(1 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(3 hunks)unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py
🚧 Files skipped from review as they are similar to previous changes (2)
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py
- airbyte_cdk/sources/declarative/declarative_component_schema.yaml
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5)
441-441
: Correctly added import for LazySimpleRetriever.Good job adding the import along with the other retrievers. This follows the project's import grouping pattern.
2678-2682
: New parameter matches the intended functionality.The new optional
incremental_sync
parameter is well-typed and correctly extends the method signature to support the lazy read feature.
2741-2746
: Proper validation for required partition_router.Good job implementing this validation to ensure a partition_router is defined when lazy_read_pointer is set. This prevents potential runtime errors.
2764-2768
: Handling of incremental_sync is appropriate.The logic correctly creates a stream slicer from the incremental_sync model if provided, otherwise falls back to a SinglePartitionRouter. This ensures proper slice management for both incremental and non-incremental scenarios.
2770-2784
: LazySimpleRetriever instantiation looks good.The instantiation passes all required parameters to the LazySimpleRetriever, including the partition_router and lazy_read_pointer. It reuses the same pattern as SimpleRetriever instantiation elsewhere in the code, which maintains consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py (1)
1-339
:⚠️ Potential issueFix formatting issues highlighted by Ruff.
There's a pipeline failure indicating that Ruff formatting check failed for this file. This will need to be fixed before merging.
#!/bin/bash # Format the file according to Ruff standards ruff format unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py🧰 Tools
🪛 GitHub Actions: Linters
[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.
🧹 Nitpick comments (1)
unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py (1)
1-4
: Consider updating the copyright year.The copyright year is set to 2025, but the current year is 2024. Should this be updated to reflect the actual year? wdyt?
-# Copyright (c) 2025 Airbyte, Inc., all rights reserved. +# Copyright (c) 2024 Airbyte, Inc., all rights reserved.🧰 Tools
🪛 GitHub Actions: Linters
[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(5 hunks)airbyte_cdk/sources/declarative/requesters/paginators/strategies/cursor_pagination_strategy.py
(0 hunks)unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py
(1 hunks)
💤 Files with no reviewable changes (1)
- airbyte_cdk/sources/declarative/requesters/paginators/strategies/cursor_pagination_strategy.py
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py
[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
🔇 Additional comments (11)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (6)
441-441
: Nice addition of LazySimpleRetriever import.Good update to include the new LazySimpleRetriever in the imports list. This makes it available for use in the factory.
2680-2685
: Good type annotation for the incremental_sync parameter.The incremental_sync parameter is well-typed with a clear Union type including all the necessary cursor model types. Adding this parameter makes the create_simple_retriever method more flexible.
2743-2748
: Good error handling for lazy_read_pointer without partition_router.This check ensures that when lazy_read_pointer is set, a partition_router is also defined, which is a requirement for the LazySimpleRetriever. The error message clearly explains what's needed to resolve the issue.
2749-2756
: Good validation for SubstreamPartitionRouter requirement.The type check ensures that lazy_read_pointer can only be used with a SubstreamPartitionRouter, preventing potential runtime errors with other router types. The error message is descriptive and helpful.
2758-2771
: Clean implementation of lazy_read_pointer preparation.The code properly converts string paths to InterpolatedString objects with appropriate parameters, and correctly sets up the partition router and stream slicer for the LazySimpleRetriever.
2772-2787
: Well-structured creation of LazySimpleRetriever.The initialization of LazySimpleRetriever with all required parameters is clean and maintains consistency with the SimpleRetriever constructor pattern. All necessary components are correctly passed.
unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py (5)
26-155
: Well-structured test configuration and manifest.The test configuration and manifest provide a comprehensive setup for testing the lazy loading functionality. The manifest includes all necessary components: streams, schema loaders, retriever configurations, and incremental sync settings.
158-171
: Helpful utility functions for test configuration.These utility functions make the tests more readable by abstracting away the creation of ConfiguredAirbyteStream and ConfiguredAirbyteCatalog objects. Good use of default parameters in to_configured_stream().
Also applies to: 174-177
180-192
: Well-documented helper functions for test execution.The create_configured_catalog and get_records functions have clear docstrings explaining their purpose. The implementation of get_records is particularly clean, using list comprehension to filter for RECORD type messages.
Also applies to: 194-208
211-271
: Comprehensive test for lazy loading retrieval.This test effectively validates the lazy loading behavior with paginated substream data. The HTTP mocking setup simulates realistic API responses with pagination, and the assertions verify that all expected records are retrieved.
273-338
: Good test for incremental sync with state.The test for incremental sync properly verifies that only new records are fetched when state is provided. The state setup and HTTP mocking are well implemented to test this scenario.
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py (4)
194-208
: Function looks good, but could handle edge cases betterThis helper function works well for the happy path, but would it be worth adding some error handling for cases where messages have unexpected formats? Perhaps adding a try/except block could make it more robust for future tests? wdyt?
def get_records( source: ConcurrentDeclarativeSource, config: dict, catalog: ConfiguredAirbyteCatalog, state: list = None, ) -> list: """ Reads records from the source given a configuration, catalog, and optional state. Returns a list of record data dictionaries. """ return [ message.record.data for message in source.read(logger=MagicMock(), config=config, catalog=catalog, state=state) - if message.type == Type.RECORD + if message.type == Type.RECORD and hasattr(message, 'record') and hasattr(message.record, 'data') ]
211-271
: Test looks thorough, but could verify more aspectsGreat test for lazy loading! It correctly verifies that records from parent streams and paginated child streams are combined. Could we enhance it by also verifying that requests for child data are only made when needed (the true "lazy" part)? Perhaps using a counter to check the number of requests made?
Also, the assertion only checks final results - would it be helpful to assert the pagination process more explicitly? wdyt?
273-340
: Incremental sync test looks well-structuredGood test coverage of incremental sync behavior! It correctly tests that only records matching the state's cursor are retrieved. I like that you're setting up a realistic state object with the proper Airbyte model structure.
One suggestion: consider testing an edge case where no new records are available after the state cursor to verify that an empty result is handled correctly? wdyt?
214-270
: Consider adding comments to clarify the multi-step test flowThe test setup with multiple HTTP mock responses is quite complex. Would it be helpful to add more comments explaining what each response is simulating? For example, explaining that the first response contains parent data and some initial child data, while subsequent responses represent pagination of additional child records. This would make the test easier to understand for future developers. wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
🔇 Additional comments (2)
unit_tests/sources/declarative/retrievers/test_lazy_simple_retriever.py (2)
1-2
: Copyright year is set to 2025I notice the copyright year is set to 2025, but it's currently 2024. Is this intentional or should it be updated to the current year? wdyt?
46-46
: Good implementation of lazy_read_pointerThe
lazy_read_pointer
configuration looks good! This seems to be the key feature being tested in this file. The pointer being set to["items"]
correctly targets the nested data structure in the parent response.
/autofix
|
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (3)
4-5
: Import order should be alphabetized.The added imports are not following alphabetical order, which is a common Python convention. Would you consider reorganizing the imports?
- import copy - import json + import copy + import json
232-246
: Type annotation could be more specific in _extract_child_response.The return annotation uses
requests.Response
but the actual return value isSafeResponse
. Would you consider making this more precise?- def _extract_child_response( - self, parent_record: Mapping[str, Any] | AirbyteMessage, pointer: List[InterpolatedString] - ) -> requests.Response: + def _extract_child_response( + self, parent_record: Mapping[str, Any] | AirbyteMessage, pointer: List[InterpolatedString] + ) -> SafeResponse:
245-245
: Small typo in the comment.There's a typo in the type ignore comment ("argunet" instead of "argument").
- return _create_response(dpath.get(parent_record, path, default=[])) # type: ignore # argunet will be a MutableMapping, given input data structure + return _create_response(dpath.get(parent_record, path, default=[])) # type: ignore # argument will be a MutableMapping, given input data structureairbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2770-2770
: Small typo in error message.There's a typo in the error message ("more that one" instead of "more than one").
- raise ValueError( - f"Found more that one slice per parent. LazySimpleRetriever only supports single slice read for stream - {name}." - ) + raise ValueError( + f"Found more than one slice per parent. LazySimpleRetriever only supports single slice read for stream - {name}." + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(7 hunks)airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
(6 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (9)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (4)
53-53
: Good addition of the lazy_read_pointer field.Adding this field to ParentStreamConfig creates a clear way to enable lazy reading functionality, which will help optimize performance for streams with limited child data exposure like Stripe's bank_accounts.
67-76
: Well-implemented field initialization.The initialization gracefully handles conversions from string to InterpolatedString, ensuring consistent behavior.
215-222
: Clean integration of lazy reading with existing code.The conditional check and insertion of child_response into extracted_extra_fields is a non-intrusive way to implement the feature. This maintains backward compatibility while adding new functionality.
425-435
: Well-designed SafeResponse class.This extension of
requests.Response
handles content properly, ensuring that content is always encoded correctly whether provided as a string or bytes. It also provides a fallback with__getattr__
for any missing attributes.airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5)
441-441
: Good addition of LazySimpleRetriever import.The import of LazySimpleRetriever enables the factory to create instances of this new class.
1691-1695
: Great extension of the create_simple_retriever method.Adding the
incremental_sync
parameter allows for more flexible configuration of retrievers. The union type properly accommodates the various cursor models that might be used.
2532-2534
: Useful validation for lazy_read_pointer wildcards.This validation prevents misuse of the feature by ensuring wildcards aren't used, which could lead to unexpected behavior or performance issues.
2537-2539
: Clean initialization of model_lazy_read_pointer.This approach ensures the field is never None but rather an empty list when not provided, making subsequent code more consistent.
2754-2792
: Comprehensive condition checking for LazySimpleRetriever.The implementation properly checks all necessary conditions before using the LazySimpleRetriever - router type, stream state, lazy_read_pointer presence, and incompatible features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (3)
233-247
: Enhance error handling in_extract_child_response
The method relies on
dpath.get()
with a default empty list, which handles missing paths gracefully. However, consider adding more explicit error handling for other potential issues like malformed data structures that might arise when extracting data from external sources.Would adding debug logging for successful extraction (similar to what's done in
_extract_extra_fields
) be helpful here? wdyt?def _extract_child_response( self, parent_record: Mapping[str, Any] | AirbyteMessage, pointer: List[InterpolatedString] ) -> requests.Response: """Extract child records from a parent record based on lazy pointers.""" def _create_response(data: MutableMapping[str, Any]) -> SafeResponse: """Create a SafeResponse with the given data.""" response = SafeResponse() response.content = json.dumps(data).encode("utf-8") response.status_code = 200 return response path = [path.eval(self.config) for path in pointer] - return _create_response(dpath.get(parent_record, path, default=[])) # type: ignore # argunet will be a MutableMapping, given input data structure + extracted_data = dpath.get(parent_record, path, default=[]) # type: ignore # argument will be a MutableMapping, given input data structure + self.logger.debug(f"Extracted child response from path: {path} with {len(extracted_data) if isinstance(extracted_data, list) else 'non-list'} data") + return _create_response(extracted_data)
426-437
: Consider making SafeResponse more robustThe
SafeResponse
class extendsrequests.Response
with a custom getter and setter for content. This is a clever approach to reuse the Response object structure.However, the current
__getattr__
implementation returnsNone
for any non-existent attribute rather than raising AttributeError. This might lead to subtle bugs where code continues silently with None values. Would following Python's attribute lookup pattern more closely be better? wdyt?class SafeResponse(requests.Response): def __getattr__(self, name: str) -> Any: - return getattr(requests.Response, name, None) + # This follows Python's standard attribute lookup pattern + try: + return getattr(super(), name) + except AttributeError: + raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'") @property def content(self) -> Optional[bytes]: return super().content @content.setter def content(self, value: Union[str, bytes]) -> None: self._content = value.encode() if isinstance(value, str) else value
246-246
: Fix typo in type ignore commentThere's a small typo in your type ignore comment - "argunet" should be "argument".
- return _create_response(dpath.get(parent_record, path, default=[])) # type: ignore # argunet will be a MutableMapping, given input data structure + return _create_response(dpath.get(parent_record, path, default=[])) # type: ignore # argument will be a MutableMapping, given input data structure
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (3)
53-53
: Good addition oflazy_read_pointer
to ParentStreamConfigThe new optional attribute for storing lazy read pointers aligns nicely with the existing pattern of configuration parameters. This will enable lazy loading of child records from parent records.
67-76
: LGTM! Well-structured initialization handlingThis initialization logic for
lazy_read_pointer
follows the same pattern as used for other interpolated string fields. The conditional handling ensures we don't process None values unnecessarily.
215-223
:❓ Verification inconclusive
Consider potential performance implications
The lazy read feature looks well implemented. When enabled, it extracts child response data from the parent record and includes it in the extra fields.
I wonder if there might be potential performance implications for large dataset extraction? Have you considered any memory usage analysis for extracting large nested datasets? wdyt?
🏁 Script executed:
#!/bin/bash # Description: Check if there are any large nested structures in the codebase that might cause memory issues # Look for deeply nested data structures in test files that might be used with this feature rg -B 2 -A 2 --type json --type yaml "lazy_read_pointer" # Check if there are any performance tests involving large datasets rg -A 5 -i "performance|benchmark|large dataset|memory" --type python "test_.*retriever|test_.*router"Length of output: 763
Revisit Potential Performance Impact for Lazy Read
The lazy read functionality in
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
(lines 215–223) is cleanly implemented and successfully integrates child responses into the extra fields. However, since this feature could be used to process large nested datasets, have you considered whether the current implementation might introduce memory or performance bottlenecks under heavy load? Would you be open to exploring additional profiling or incorporating benchmark tests to validate its behavior in such scenarios? wdyt?
…m:airbytehq/airbyte-python-cdk into lazebnyi/add-lazy-read-to-simple-retriver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2764-2773
: Fixed a tiny typo in the error message.There's a small typo in the error message - "more that one" should be "more than one".
- raise ValueError(f"Found more that one slice per parent. LazySimpleRetriever only supports single slice read for stream - {name}.") + raise ValueError(f"Found more than one slice per parent. LazySimpleRetriever only supports single slice read for stream - {name}.")Subtle but worth fixing, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (8)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (8)
441-441
: LGTM! Added import for the new LazySimpleRetriever class.This correctly adds the import for the new
LazySimpleRetriever
class that implements the lazy read functionality mentioned in the PR description.
1692-1695
: Nice addition of incremental_sync parameter to create_simple_retriever.Adding this parameter allows passing the incremental sync configuration to the retriever, which is needed for proper initialization of the lazy read functionality. The typing looks correct with proper Union of all supported cursor models.
2532-2539
: Good validation for wildcard characters in lazy_read_pointer.This check properly ensures that wildcard characters aren't supported in the lazy_read_pointer paths, with a clear error message explaining the limitation.
2537-2540
: Well-formed preparation of the lazy_read_pointer list.The conversion to a properly typed list ensures compatibility with the expected parameter type in ParentStreamConfig.
2550-2550
: LGTM! Added lazy_read_pointer to ParentStreamConfig constructor.This correctly passes the prepared lazy_read_pointer to the ParentStreamConfig constructor.
2754-2763
: Great implementation of the LazySimpleRetriever conditional logic.This checks all necessary conditions before deciding to use LazySimpleRetriever:
- Confirms it's a SubstreamPartitionRouter
- Verifies we don't have existing stream state
- Ensures at least one parent_stream_config has lazy_read_pointer set
All of these are needed to properly enable the lazy read functionality.
2774-2778
: Good decoder type check for LazySimpleRetriever.This validation ensures that only JsonDecoder is used with LazySimpleRetriever, which is important for ensuring compatibility with the implementation.
2779-2792
: LGTM! Creation of the LazySimpleRetriever with proper parameters.This correctly creates a new LazySimpleRetriever instance with all the necessary parameters from the model. The parameter list matches what's expected based on the signature and maintains consistency with the SimpleRetriever pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this change. I think it is a great and simple implementation for a feature that might seem complex at first.
I just added a couple of comments to understand some stuff but I'm not blocking the release on these
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)
654-654
: Consider type annotation or wrapper for records_generator_fn.The type ignore comment
# type: ignore[call-arg]
suggests there's a type compatibility issue withrecords_generator_fn(response)
.Would it make sense to add a wrapper function with proper type annotations or a more specific type for
records_generator_fn
to avoid the need for type ignore? wdyt?
672-702
: Good implementation of pagination with some type improvements possible.The
_paginate
method effectively handles pagination by fetching subsequent pages. A few suggestions:
- Consider using a more specific type annotation for
next_page_token
(e.g.,Optional[Mapping[str, Any]]
) instead ofAny
to match the rest of the codebase.- The same type ignore comment appears on line 686. Consider addressing both instances together.
Otherwise, this implementation looks solid with proper pagination termination conditions.
- def _paginate( - self, - next_page_token: Any, + def _paginate( + self, + next_page_token: Optional[Mapping[str, Any]],
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(8 hunks)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (python)
🔇 Additional comments (6)
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (6)
9-19
: LGTM: Import organization enhanced for new features.The import organization has been improved to include necessary types for the new LazySimpleRetriever functionality. The explicit listing of imports from typing module improves readability.
22-22
: Good addition of deprecated decorator support.Adding the
deprecated
import from typing_extensions will be used to mark experimental classes. This is good practice for API evolution.
42-42
: LGTM: Adding ExperimentalClassWarning for proper deprecation.The import of
ExperimentalClassWarning
will be used with the deprecated decorator - good practice for marking experimental features.
453-453
: LGTM: Passing stream_slice to parse_records via partial.Correctly passing the stream_slice to the record generator function ensures proper context for record parsing.
659-659
: Check for None before calling _next_page_token.According to a previous review comment,
_next_page_token
expects arequests.Response
but may receiveNone
. Consider adding a guard clause:#!/bin/bash # Check if SimpleRetriever._next_page_token handles None responses rg -A 5 "def _next_page_token.*response: .*Response" --glob "*.py"
635-670
:❓ Verification inconclusive
Check recursion safety in else branch of _read_pages.
The implementation of
_read_pages
provides a good lazy reading approach, but I'm concerned about line 670 where we recursively call_read_pages
with identical parameters ifresponse
is None. This appears to risk infinite recursion.Are you sure this is the intended behavior? Perhaps this should fall back to the parent implementation with
super()._read_pages()
instead? wdyt?
🏁 Script executed:
#!/bin/bash # Check if any tests verify this potentially recursive case rg -A 5 -B 5 "LazySimpleRetriever.*_read_pages.*response.*None" --glob "*.py"Length of output: 79
Ensure Recursion Safety in LazySimpleRetriever._read_pages
I noticed that whenstream_slice.extra_fields["child_response"]
isNone
, the method recursively calls itself with the same arguments, which might cause infinite recursion if the state doesn’t eventually change. Did you intend for it to fall back to the parent's implementation by callingsuper()._read_pages(...)
or include additional termination logic to break the recursion? Can you verify this behavior—perhaps by adding a test case—to ensure it doesn’t lead to an unintended infinite loop? wdyt?
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2893-2902
: Review of thelazy_read_pointer
PropertyThe new
lazy_read_pointer
property is a neat addition that aligns well with the PR objectives to enable lazy read functionality. I have a couple of friendly questions/suggestions to consider:
Items Schema Formatting: I noticed that the items for the array are defined using a list notation (
- type: string
). Typically, for a homogeneous array, we would define it as an object like this:items: type: stringWould you consider switching to this format for clarity and consistency with the rest of the schema? wdyt?
Downstream Compatibility: Have you verified that consumers of the schema properly handle an empty array as the default value and that the lazy reading behavior integrates as expected? It might be worth confirming through a few tests if that isn’t already covered.
Overall, the property looks well-defined and seems to provide the intended functionality. I'm looking forward to your thoughts on these points.
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (3)
219-219
: Small typo in comment.There's a minor spelling error in the comment.
- # lazy_read_pointer type handeled in __post_init__ of parent_stream_config + # lazy_read_pointer type handled in __post_init__ of parent_stream_configwdyt?
246-246
: Small typo in comment.There's a minor spelling error in the comment.
- return _create_response(dpath.get(parent_record, path, default=[])) # type: ignore # argunet will be a MutableMapping, given input data structure + return _create_response(dpath.get(parent_record, path, default=[])) # type: ignore # argument will be a MutableMapping, given input data structurewdyt?
421-432
: Consider adding more documentation for the SafeResponse class.The SafeResponse class is a clever implementation that extends requests.Response, but it could benefit from a more descriptive comment about its purpose and usage.
+class SafeResponse(requests.Response): + """A safe wrapper around requests.Response that handles both string and bytes content. + + This class simplifies response handling by: + 1. Allowing content to be set directly as either string or bytes + 2. Providing default attribute access via __getattr__ + 3. Supporting the interface expected by decoders that process HTTP responses + """ -class SafeResponse(requests.Response): def __getattr__(self, name: str) -> Any: return getattr(requests.Response, name, None)wdyt?
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2777-2777
: Fix typo in error message.There's a small typo in the error message.
- raise ValueError(f"Found more that one slice per parent. LazySimpleRetriever only supports single slice read for stream - {name}.") + raise ValueError(f"Found more than one slice per parent. LazySimpleRetriever only supports single slice read for stream - {name}.")wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(1 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(8 hunks)airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
[error] 1-1: Ruff formatting check failed. 1 file would be reformatted. Run 'ruff format' to fix code style issues in this file.
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
🔇 Additional comments (9)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (4)
53-53
: Nice addition of the lazy_read_pointer configuration parameter!This optional list parameter enables the new lazy reading functionality for parent-child stream relationships. The type annotation is clear and consistent with other parameters in this class.
67-76
: Clean implementation of lazy_read_pointer processing.The post-init handling correctly converts string paths to InterpolatedString objects, maintaining consistency with how other configuration options are processed in this class.
215-223
: Good implementation of the lazy reading feature.This logic efficiently identifies when to use lazy reading and extracts child responses from the parent record. The approach of adding the response to extra_fields as "child_response" makes it easily accessible downstream.
233-247
: Well-structured implementation of child response extraction.This helper method cleanly implements the extraction logic with good parameter typing. The inner function for response creation helps keep the code DRY.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5)
441-441
: Correctly import the new LazySimpleRetriever class.This import makes the new retriever class available to the factory for instantiation.
2537-2540
: Good validation of lazy_read_pointer wildcards.This validation prevents potentially problematic use of wildcards in lazy read paths, providing a clear error message to guide users.
2542-2545
: Clean handling of lazy_read_pointer in parent stream config.This code properly processes the lazy_read_pointer into a consistent list format, following the pattern used for other similar configurations.
2761-2785
: Well-structured conditions for LazySimpleRetriever usage.The code clearly establishes when to use LazySimpleRetriever, with appropriate validations for parent-child stream relationships and incremental sync compatibility.
2786-2799
: Good implementation of LazySimpleRetriever instantiation.The code passes all the necessary parameters to LazySimpleRetriever, maintaining consistency with the regular SimpleRetriever instantiation pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2763-2801
: Core implementation of LazySimpleRetriever conditional logicThis section implements the decision logic for when to use LazySimpleRetriever, with appropriate validation checks for:
- SubstreamPartitionRouter configuration
- State management
- Compatible cursor types
- Decoder types
Small typo in error message
There's a small typo in the error message at line 2779: "more that one" should be "more than one" - wdyt?
- f"Found more that one slice per parent. LazySimpleRetriever only supports single slice read for stream - {name}." + f"Found more than one slice per parent. LazySimpleRetriever only supports single slice read for stream - {name}."
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)
441-441
: Import added to support lazy read functionalityGood addition, this import is required for the new lazy read implementation.
1700-1705
: Clean type signature update for incremental_sync parameterThe new parameter allows the retriever to access cursor information needed for lazy reading. The type hints are well-defined, making it clear what model types are supported.
2538-2546
: Good validation for lazy_read_pointer wildcardsProper validation to ensure users don't use wildcards in lazy_read_pointer paths. The error message clearly explains what's supported.
2557-2557
: Added lazy_read_pointer parameter to ParentStreamConfigThis change properly passes the lazy_read_pointer to the constructor, enabling the lazy reading functionality.
…m:airbytehq/airbyte-python-cdk into lazebnyi/add-lazy-read-to-simple-retriver
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (4)
53-53
: Document the newlazy_read_pointer
attribute in the class docstring.This new attribute appears to be for the lazy reading feature, but it's not described in the class docstring. Would you consider updating the docstring to include information about what this attribute does and how it's used? This would help users understand the purpose and usage of this new feature.
233-247
: Consider adding error handling for missing paths in the parent record.The method correctly extracts child records based on lazy pointers, but what if the specified path doesn't exist in the parent record? The current implementation with
default=[]
handles missing paths well, but it might be worth adding a debug log similar to what you have in_extract_extra_fields
to track when paths are missing, wdyt?
246-246
: Fix typo in the type ignore comment.There's a typo in your comment: "argunet" should be "argument".
- return _create_response(dpath.get(parent_record, path, default=[])) # type: ignore # argunet will be a MutableMapping, given input data structure + return _create_response(dpath.get(parent_record, path, default=[])) # type: ignore # argument will be a MutableMapping, given input data structure
428-430
: Consider modifying the getattr implementation to raise AttributeError for missing attributes.Currently, your implementation returns None for attributes that don't exist in requests.Response, which might hide bugs if code tries to access a non-existent attribute. Would it be better to raise an AttributeError for attributes that don't exist, which is the standard Python behavior? wdyt?
- def __getattr__(self, name: str) -> Any: - return getattr(requests.Response, name, None) + def __getattr__(self, name: str) -> Any: + try: + return getattr(requests.Response, name) + except AttributeError: + raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Publish SDM to DockerHub
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (4)
67-76
: LGTM! Nice addition of lazy_read_pointer initialization.The implementation correctly handles both string and InterpolatedString instances, similar to how extra_fields is processed.
215-223
: Good implementation of the lazy reading feature.The conditional check for lazy_read_pointer and subsequent extraction of child response is well-integrated with the existing code flow. I like how you're keeping the child response in the extra_fields, making it accessible to downstream components.
421-426
: Great docstring for the SafeResponse class.This docstring clearly explains the purpose of the class. Thanks for adding this documentation as suggested in the previous review!
431-437
: LGTM! Good implementation of the content property.The property implementation correctly handles both string and bytes content, ensuring proper encoding.
Fixed: https://github.com/airbytehq/airbyte-internal-issues/issues/11473?issue=airbytehq%7Cairbyte-internal-issues%7C11942
This PR introduces
LazySimpleRetriever
, built on top ofSimpleRetriever
with a newlazy_read_pointer
field. When this field is set, the retriever switches to a lazy read mode. This means all full refresh syncs and incremental initial reads will follow the lazy flow, reducing unnecessary child requests and improving performance — especially for streams with limited child data exposure (e.g., Stripebank_accounts
).Summary by CodeRabbit
Summary by CodeRabbit
New Features
LazySimpleRetriever
class for improved data retrieval capabilities.Bug Fixes
Documentation